1. Setup

In this notebook we use the following R packages: tidyverse and ggplot2. In addition we use the multiplot functionality provided by Cookbook for R. In Python we use fastai, numpy and pandas. Full code for training and inference available at GitHub. All code for statistical analysis is included in this document.

## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Warning: package 'dplyr' was built under R version 3.4.2
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats

2. Introduction

The success of deep learning models has typically been measured in terms of their predictive power, but they have lacked a principled way of expressing uncertainty about these predictions.

In my master’s thesis we apply recent insights connecting dropout neural networks to Bayesian approximate variational inference (VI). VI is technique for approximating intractable integrals that arise when modelling the posterior distribution in Bayesian inference and machine learning. Gal et. al. [1, 2, 3] have shown that the posterior distribution of a NN can be approximated by leaving dropout on at test time and sampling multiple predictions. This amounts to drawing Monte Carlo (MC) samples from the posterior (a brief description of this process is listed in section 1.1). The Bayesian framework allows us to obtain principled uncertainty estimates when making predictions in these so called MC dropout NNs.

The results of [1,2,3] seem particularly promising when applied to healthcare. In fact, a recent paper published in Nature [4] applies MC dropout to capture uncertainty estimates when predicting the presence of diabetic retinopathy in patients. In this notebook I will first briefly introduce the general idea behind MC dropout. Then we will apply MC dropout in a classification task based on the collection of labelled images in the CIFAR-10 data set.

2.1 Background

A neural network is made up of many neurons which are organized in layers. Neurons are often called nodes or units, depending on your choice of machine learning literature. Henceforth we will refer to them as units.

In a typical neural network there are many, many units. As a consequence the number of parameters greatly exceeds the number of data points, dramatically increasing the risk of overfitting (i.e. there are many settings of weights which will cause the network to fit the data almost perfectly).

Dropout is a common stochastic regularisation technique [5] that is used to prevent overfitting in neural networks. The term dropout refers to randomly “dropping out” nonoutput units, temporarily removing all connections to the rest of the network. The main idea is that if the presence of other units is unreliable, each unit is forced to learn how to be useful on its own. At test time all units are activated, and the weights of the network are scaled by the dropout rate in order to match the expected output during training.

Recent work by Gal et. al. [1, 2, 3] casts dropout neural networks as approximate Bayesian inference. Their results show that the predictive posterior distribution of a neural network can be approximated by leaving dropout on at test time.

Consider a classification setting, such as in CIFAR-10. Essentially, what happens when applying MC dropout is the following:

Mathematically, model uncertainy is approximated the empirical standard deviation of the predictions for class \(k\), i.e. \[\hat{\sigma}_k = \sqrt{\frac{\sum_{t=1}^T{[p_t(k|X,w) - \hat{\mu}_k}]^2}{T-1}}\] where \[\hat{\mu}_k = \frac{1}{T}\sum_{t=1}^T p_t(k|X,w)\] is the averaged softmax outputs of the predicted class.

Gal et. al. [3] show that the above amounts to drawing Monte Carlo samples from the predictive posterior. Their work demonstrates that applying dropout is effectively the same as defining a Bayesian neural network with a Bernoulli approximating prior over the parameters. Gal has written an excellent blog post that introduces the work and the derivation.

2.2 Problem statement

The derivation in Gal et. al. [2] is based on the observation that a shallow neural network with infinitely many weights converges to a Gaussian process (GP) with a specific covariance function. A GP is a non-parametric, probabilistic machine learning method. The idea is that we place a prior distribution over the function space, and by observing new data points we can figure out which function is most likely to have generated the observed data. A GP is fully specified by its mean and covariance functions, and allows us to obtain uncertainty estimates [1]. The extension of this to dropout neural networks is the main result of Gal et. al. [2].

In [1] however, Gal et. al. extend this idea further to include priors over the weights of the convolutional layers in a convolutional neural network (CNN). A CNN is a specialized type of neural network, used primarily and very effectively for image analysis. The authors briefly state that the GP interpretation is lost when the model is extended to CNNs, but that these networks still can be “modelled as Bayesian”.

As far as we are aware, there have been no efforts to determine the correlation between the approximated uncertainty and a network’s ability to predict correctly. Moreover, the work done so far seems to focus on the uncertainty associated with the predicted class. We will examine if there is any additional information to be gained from establishing a connection between the prediction and the runner-up prediction. In other words, our problem statement is:

Are the uncertainty approximations obtained by applying Monte Carlo dropout to convolutional neural networks a reasonable measure of model uncertainty?

2.3 Experimental setup

State-of-the-art architectures such as ResNet and DenseNet are very powerful, but they are also complicated and their inner workings are quite convoluted. We are primarily interested in examining the correlation between uncertainty estimation and predictive capabilities. It is arguably better to use a simple network architecture to illustrate the idea of MC dropout. In our approach we use LeNet-5.

LeNet-5 was a pioneering 7-layer convolutional neural network originally developed by Yann LeCunn in 1998 for handwritten digit recognition. It is hopelessly primitive compared to contemporary architectures, but still captures the gist of what a convolutional network is while remaining simple enough to allow us to understand every building block of the network. The following chunk shows the model architecture:

class lenet_all(nn.Module):
    def __init__(self, conv_size=conv_size, pool_size=2, drop_rate=p):
        super().__init__()
        self.drop_rate = drop_rate
        self.conv1 = nn.Conv2d(in_channels=3, out_channels=192, kernel_size=conv_size)
        self.dropmc1 = DropoutMC(p)
        self.pool1 = nn.MaxPool2d(kernel_size=pool_size, stride=2)
        self.conv2 = nn.Conv2d(in_channels=192, out_channels=192, kernel_size=conv_size, padding=2)
        self.dropmc2 = DropoutMC(p)
        self.pool2 = nn.MaxPool2d(kernel_size=pool_size, stride=2)
        self.dense1 = nn.Linear(in_features=7*7*192, out_features=1000)
        self.dropmc3 = DropoutMC(p)
        self.dense2 = nn.Linear(in_features=1000, out_features=10)
        
    def forward(self, x):
        x = self.dropmc1(self.conv1(x))
        x = self.pool1(x)
        x = self.dropmc2(self.conv2(x))
        x = self.pool2(x)
        x = x.view(x.size(0), -1)
        x = self.dropmc3(F.relu(self.dense1(x)))       
        x = self.dense2(x)
        
        return x


The above specification of LeNet-5 differs from the orginial in one crucial way: We use Monte Carlo dropout layers. MC dropout is not a feature that is implemented in PyTorch, and we must therefore implemenet one ourselves. Fortunately, this amounts to a simple adjustment of existing code. We modify the Dropout class to take an additional argument called dropoutMC with default value set to True. :

class DropoutMC(nn.Module):
    r"""
    Modified version of Dropout from torch/nn/modules/dropout.py
    Args:
        p: probability of an element to be zeroed. Default: 0.5
        dropoutMC: If set to ``True``, dropout is turned on at test time. Default: ``True`
        inplace: If set to ``True``, will do this operation in-place. Default: ``False``
    Shape:
        - Input: `Any`. Input can be of any shape
        - Output: `Same`. Output is of the same shape as input
    Examples::
        >>> m = nn.Dropout(p=0.2)
        >>> input = autograd.Variable(torch.randn(20, 16))
        >>> output = m(input)
    .. _Improving neural networks by preventing co-adaptation of feature
        detectors: https://arxiv.org/abs/1207.0580
    """

    def __init__(self, p=0.5, dropoutMC=True, inplace=False):
        super(DropoutMC, self).__init__()
        if p < 0 or p > 1:
            raise ValueError("dropout probability has to be between 0 and 1, "
                             "but got {}".format(p))
        self.p = p
        self.dropoutMC = dropoutMC
        self.inplace = inplace

    def forward(self, input):
        return F.dropout(input, self.p, self.dropoutMC, self.inplace)

    def __repr__(self):
        inplace_str = ', inplace' if self.inplace else ''
        return self.__class__.__name__ + '(' \
            + 'p=' + str(self.p) \
            + inplace_str + ')'


We need to define a function that performs inference over out input. The following chunk contains the relevant code for the sampling procedure described in section 1.1:

def inference(learner, data, T=100):
    ''' Function that gathers all relevant numerical results from MC dropout over T iterations.
        
        Arguments:
        learner, fastai learner object
        data, fastai dataloader
        T, number of stochastic forward passes
    '''
    # Get images, labels and filenames
    imgs, labels = next(iter(data.val_dl))
    fnames = data.val_ds.fnames

    # Empty dictionary to store all output
    output = {}

    # Empty array to store results in
    results = np.empty((T, num_classes))

    # iterator index to keep in dictionary
    k=0

    for (img, label, fname) in list(zip(imgs, labels, fnames)):

        for i in range(T):
            prediction = learner.predict_array(img[None])
            results[i] = prediction

        probs = to_np(F.softmax(V(results)))
        probs_mean = np.mean(probs, axis=0)
        pred_std = np.std(probs, axis=0)

        prediction = probs_mean.argmax()
        uncertainty = pred_std[prediction]

        correct = 1 if prediction == label else 0

        output[k] = {"img": fname, "softmax_dist": probs, "probs": probs_mean, "prediction": prediction, "truth": label, "uncertainty": uncertainty, "correct": correct}
        k+=1
    
    return output


The function inference stores all the relevant statistics and softmax distributions in a dictionary named output. The results are then turned into a pandas dataframe and some very basic feature engineering is performed. Finally, the data is prepared for statistical analysis in R. The relevant code can be found on GitHub.

2.4 Models

We will examine data gathered from four variants of LeNet-5. All models were trained on CIFAR-10 using the fastai API. CIFAR-10 contains 60.000 labelled 32x32x3 color images belonging to 10 different classes. The input data was split into a training set of 50.000 images and a test set of 10.000 images. The training set was further split into a training set and a validation set. All models have weight_decay = 0.0005 and all learning rates were chosen using lr_find:

Note that model55 has a slightly better baseline performance than the other models.

2.5 Data

The data contains the following variables after it has been prepared for analysis in R:

PUT THIS IN AFTER PLOT, EDIT IF NECESSARY Intuitively, if diff is large, the averaged models all agree that class \(k\) is the correct prediction. If diff is small, the models sampled by MC dropout don’t agree on a single class. Thus diff also serves as a proxy for uncertainty. Model uncertainty, however, is approximated by the empirical standard deviation of the predictions for class \(k\). Thus diff_sd_ratio is expressed by \[\tau_{kj} = \frac{\hat{\mu}_k - \hat{\mu}_j}{\hat{\sigma}_k}\] where \(j\) is the runner-up prediction. \(\tau_{jk}\) gives us a ratio of two different measures of uncertainty. Consider the following cases:

3. Exploratory analysis

This section will be divided into to parts. First, we will examine the resulting data from model55, which will be regarded as our baseline model. Next, we will analyse the data from models obtained by varying kernel sizes and dropout rates.

3.1 Uncertainty analysis of baseline model

# Importing data
data <- as.tibble(read.csv("~/Documents/Masteroppgave/Data/Resultater/lenet-model55.csv"))
df <- select(data, -X)

# Summarizing entire data set
summary(df)
##     correct         prediction        truth      uncertainty       
##  Min.   :0.0000   Min.   :0.000   Min.   :0.0   Min.   :0.0000029  
##  1st Qu.:1.0000   1st Qu.:2.000   1st Qu.:2.0   1st Qu.:0.0400673  
##  Median :1.0000   Median :5.000   Median :4.5   Median :0.1274580  
##  Mean   :0.7916   Mean   :4.561   Mean   :4.5   Mean   :0.1184386  
##  3rd Qu.:1.0000   3rd Qu.:7.000   3rd Qu.:7.0   3rd Qu.:0.1831702  
##  Max.   :1.0000   Max.   :9.000   Max.   :9.0   Max.   :0.3526330  
##      prob1            prob2               class2           diff          
##  Min.   :0.1835   Min.   :0.0000004   Min.   :0.000   Min.   :0.0000124  
##  1st Qu.:0.5814   1st Qu.:0.0188524   1st Qu.:2.000   1st Qu.:0.3340403  
##  Median :0.8281   Median :0.1021314   Median :4.000   Median :0.7213734  
##  Mean   :0.7657   Mean   :0.1343779   Mean   :4.082   Mean   :0.6312929  
##  3rd Qu.:0.9700   3rd Qu.:0.2257584   3rd Qu.:6.000   3rd Qu.:0.9503560  
##  Max.   :1.0000   Max.   :0.4989194   Max.   :9.000   Max.   :0.9999990  
##  diff_sd_ratio       logit_prob1     
##  Min.   :     0.0   Min.   :-1.4929  
##  1st Qu.:     1.7   1st Qu.: 0.3284  
##  Median :     5.4   Median : 1.5722  
##  Mean   :   208.5   Mean   : 2.1424  
##  3rd Qu.:    23.5   3rd Qu.: 3.4750  
##  Max.   :340800.1   Max.   :14.3329
# Summarizing incorrect predictions
df %>% 
  filter(correct==0) %>% 
  summary()
##     correct    prediction        truth        uncertainty      
##  Min.   :0   Min.   :0.000   Min.   :0.000   Min.   :0.001013  
##  1st Qu.:0   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:0.153040  
##  Median :0   Median :4.000   Median :4.000   Median :0.181794  
##  Mean   :0   Mean   :4.342   Mean   :4.049   Mean   :0.180244  
##  3rd Qu.:0   3rd Qu.:6.000   3rd Qu.:5.000   3rd Qu.:0.210907  
##  Max.   :0   Max.   :9.000   Max.   :9.000   Max.   :0.333903  
##      prob1            prob2              class2           diff          
##  Min.   :0.1835   Min.   :0.000185   Min.   :0.000   Min.   :0.0000124  
##  1st Qu.:0.4103   1st Qu.:0.173676   1st Qu.:2.000   1st Qu.:0.0959377  
##  Median :0.5264   Median :0.239908   Median :4.000   Median :0.2403054  
##  Mean   :0.5451   Mean   :0.242866   Mean   :4.038   Mean   :0.3022316  
##  3rd Qu.:0.6602   3rd Qu.:0.312481   3rd Qu.:6.000   3rd Qu.:0.4645013  
##  Max.   :0.9994   Max.   :0.498919   Max.   :9.000   Max.   :0.9992357  
##  diff_sd_ratio       logit_prob1     
##  Min.   :  0.0001   Min.   :-1.4929  
##  1st Qu.:  0.5130   1st Qu.:-0.3629  
##  Median :  1.2146   Median : 0.1055  
##  Mean   :  2.8113   Mean   : 0.2501  
##  3rd Qu.:  2.4979   3rd Qu.: 0.6642  
##  Max.   :986.4482   Max.   : 7.4530
# Summarizing correct predictions
df %>% 
  filter(correct==1) %>% 
  summary()
##     correct    prediction        truth        uncertainty       
##  Min.   :1   Min.   :0.000   Min.   :0.000   Min.   :0.0000029  
##  1st Qu.:1   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:0.0263485  
##  Median :1   Median :5.000   Median :5.000   Median :0.0965002  
##  Mean   :1   Mean   :4.619   Mean   :4.619   Mean   :0.1021673  
##  3rd Qu.:1   3rd Qu.:7.000   3rd Qu.:7.000   3rd Qu.:0.1672548  
##  Max.   :1   Max.   :9.000   Max.   :9.000   Max.   :0.3526330  
##      prob1            prob2               class2           diff          
##  Min.   :0.1863   Min.   :0.0000004   Min.   :0.000   Min.   :0.0007845  
##  1st Qu.:0.7009   1st Qu.:0.0114910   1st Qu.:2.000   1st Qu.:0.5144494  
##  Median :0.9002   Median :0.0618456   Median :4.000   Median :0.8352473  
##  Mean   :0.8237   Mean   :0.1058168   Mean   :4.094   Mean   :0.7179230  
##  3rd Qu.:0.9820   3rd Qu.:0.1758222   3rd Qu.:7.000   3rd Qu.:0.9697941  
##  Max.   :1.0000   Max.   :0.4886857   Max.   :9.000   Max.   :0.9999990  
##  diff_sd_ratio       logit_prob1     
##  Min.   :     0.0   Min.   :-1.4742  
##  1st Qu.:     2.8   1st Qu.: 0.8516  
##  Median :     8.6   Median : 2.1997  
##  Mean   :   262.7   Mean   : 2.6406  
##  3rd Qu.:    36.9   3rd Qu.: 3.9965  
##  Max.   :340800.1   Max.   :14.3329
# Aggregating summary statistics by correct/incorrect
agg_df <- df %>% 
  group_by(correct) %>% 
  summarise(n=n(),
          mean_uncertainty=mean(uncertainty), 
          sd_uncertainty=sd(uncertainty), 
          mean_diff=mean(diff),
          sd_diff=sd(diff),
          mean_ratio=mean(diff_sd_ratio),
          sd_ratio=sd(diff_sd_ratio))
agg_df
  • The model has incorrectly classified 2084 images. The mean uncertainty estimate for the incorrect predictions is \(\sigma_{incorrect} =\) 0.1802444. The standard deviation of the uncertainty estimate is 0.0461661.

  • The model has correctly classified 7916 images. The mean estimated uncertainty is \(\sigma_{correct} =\) 0.1021673. The standard deviation of the uncertainty estimates in the correctly classified cases is 0.0780871.

3.1.1 Distribution of uncertainty

In the following we will visualize the relationships between our variabels. We start by examining the empirical distribution of the uncertainty estimates \(\hat{\sigma}_k\):

# Distribution of estimated uncertainty
p1 <- df %>% 
  ggplot(aes(x=uncertainty)) +
  geom_histogram(col="grey", bins = 50, alpha=.5) +
  ggtitle("Distribution of estimated uncertainty")
p1

The distribution appears to be bimodal. By grouping the uncertainty estimates by correct (i.e. if the label was correctly predicted or not), we can find out how the predictions contribute to the uncertainty distribution:

#Distribution of estimated uncertainty by prediction
p2 <- df %>% 
  ggplot(aes(x=uncertainty, col=as.factor(correct))) +
  geom_freqpoly(alpha=.7) +
  ggtitle("Distribution of estimated uncertainty by classification") +
  scale_color_discrete(name="Prediction",
                         breaks=c("0", "1"),
                         labels=c("0: Incorrect", "1: Correct"))
p2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The blue line corresponds to the correct predictions, the red line corresponds to incorrect predictions. We see that the incorrect predictions are centered around a higher associated uncertainty, whereas far more of the correctly predicted classes are concentrated around a low uncertainty value. The incorrect classifications greatly contribute to the bimodality, but it is also present in the distribution of uncertainty for the correct classifications. The following kernel density estimate plot (using a Gaussian kernel) gives us an idea of how the distributions compare to eachother:

# KDE by correct prediction
p3 <- df %>% 
  ggplot(aes(x=uncertainty)) +
  geom_density(data=subset(df, correct==0), fill="red", alpha=I(.2)) +
  geom_density(data=subset(df, correct==1), fill="turquoise", alpha=I(.2)) +
  ggtitle("Kernel density estimates of uncertainty distribution")
p3

The boxplot gives us yet another way to visualize the difference between uncertainty distributions by predictive ability:

# Boxplot of uncertainties for correct vs. incorrect
p4 <- df %>% 
  ggplot(aes(x=as.factor(correct), y=uncertainty)) +
  geom_boxplot(aes(fill=as.factor(correct)), alpha=.7) +
  labs(x="correct") +
  ggtitle("Boxplot of uncertainty distribution by correct/incorrect") +
  scale_fill_discrete(name="Prediction",
                         breaks=c("0", "1"),
                         labels=c("0: Incorrect", "1: Correct"))
p4

3.1.2 Relationship to other variables

We may also be interested in the relationship between uncertainty and other variables. First, we plot uncertainty against prob1 (the prediction’s softmax output):

# Plotting softmax output of prediction against estimated uncertainty
p5 <- df %>% 
  ggplot(aes(x=prob1, y=uncertainty)) +
  geom_point(alpha=.2) +
  ggtitle("Uncertainty vs. softmax output of prediction for all observations") +
  labs(x="softmax output of predicted class") +
  scale_color_distiller(name="Softmax output",
                        palette = "Spectral")
p5

The softmax outputs in the above plot are colour graded. Outputs close to 1 are red, outputs close to 0 are blue. The plot shows a clear parabolic shape. We can obtain more information by colouring the points by the value of the runner-up predictions:

# Plotting softmax output of prediction against estimated uncertainty, coloured by softmax output of runner-up
p5 <- df %>% 
  ggplot(aes(x=prob1, y=uncertainty, col=prob2)) +
  geom_point(alpha=.5) +
  ggtitle("Uncertainty vs. softmax output of prediction for all observations") +
  labs(x="softmax output of predicted class") +
  scale_color_distiller(name="Runner up",
                        palette = "Spectral")
p5

This plot is particularly interesting. The softmax outputs of the runner-up classes \(\hat{\mu}_j\) in the above plot are colour graded. \(\hat{\mu}_j \approx 0.5\) are red, \(\hat{\mu}_j \approx 0\) are blue. We see a clear concentration of red points in the area where the probability of the predictied class \(\hat{\mu}_k \approx 0.5\). If the softmax predictions of both the predicted class and the runner-up are close 0.5, then we have a situation analogous to maximum entropy. This points coincide with the largest approximated uncertainty values.

In the following plot we split the observations by incorrect/correct predictions:

# Plotting uncertainty vs. softmax output of prediction
p6 <- df %>% 
  ggplot(aes(x=prob1, y=uncertainty)) +
  geom_point(aes(col=prob2),alpha=0.5) +
  geom_density_2d(col="black", alpha=.3) +
  ggtitle("Uncertainty vs. softmax output of prediction by incorrect/correct") +
  labs(x="softmax output of prediction") +
  facet_grid(.~as.factor(correct)) +
  scale_color_distiller(name="Runner up",
                        palette = "Spectral")
p6

The black contour lines indicate where most of the points are concentrated. The plot on the left is for incorrect predictions. The right hand plot represents the correct predictions. For the correct predictions, it seems as if far more of the points are concentrated around high predicted output/low runner-up output/low uncertainty. This is not surprising, considering 75% of the correct predictions have a softmax value of approximately 0.7 or above. For the incorrect classifications, most of the points are concentrated around the area of maximum entropy. This indicates that the approximated uncertainty estimates indeed contain valuable information in the incorrect cases.

The following plot shows the relationship between uncertainty estimates and the softmax output of the runner-up prediction. Unsurprisingly, model uncertainty increases as the softmax output of the runner-up increases. We have plotted a LOESS estimate of the mean uncertainty as a function of the runner-up output to make this clearer:

# Plotting softmax output of runner-up against estimated uncertainty
p7 <- df %>% 
  ggplot(aes(x=prob2, y=uncertainty)) +
  geom_point(alpha=.2) +
  geom_smooth(method="loess") +
  labs(x="softmax output of runner-up") +
  ggtitle("Uncertainty vs. softmax output of runner-up")
p7

3.1.3 Incorrectly classified images associated with high uncertainty

Let us take a look at the five most uncertain, incorrectly classified images:

For the most ucertain images the runner-up suggestion is non-sensical. But what would happen to our overall accuracy if we used the runner-up predictions for all incorrect classifications?

# Checking if runner-up is equal to groung truth
df %>% 
  filter(correct==0) %>% 
  mutate(correct=if_else(class2==truth,1,0)) %>% 
  count(correct)

Accuracy would rise to 0.9079. This indicates that there may be some valuable information to be gathered from the runner-up.

3.1.4 Correctly classified images associated with high uncertainty

In the following we look at the five most certain, incorrectly classified images:

3.1.3 Uncertainty-prediction correlation

Logistic regression gives us a simple way of testing if the approximated uncertainty is a useful predictor of the model’s ability to predict correctly.

# Fitting model on full data set to check significance of predictor
model <- glm(as.factor(correct)~uncertainty, data=df, family = binomial(link="logit"))
summary(model)
## 
## Call:
## glm(formula = as.factor(correct) ~ uncertainty, family = binomial(link = "logit"), 
##     data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6704   0.2409   0.3588   0.7100   2.0114  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   3.55249    0.07625   46.59   <2e-16 ***
## uncertainty -15.40803    0.43250  -35.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10236.6  on 9999  degrees of freedom
## Residual deviance:  8467.6  on 9998  degrees of freedom
## AIC: 8471.6
## 
## Number of Fisher Scoring iterations: 5

3.2 Comparison of all models

We gather the data from all models in a single dataframe df:

ROOT <- "~/Documents/Masteroppgave/Data/Resultater/"

# Kernel size 3, drop rate = .2
df32 <- process(ROOT, "lenet-model32.csv", "model32", 3, .2)
# Kernel size 3, drop rate = .5
df35 <- process(ROOT, "lenet-model35.csv", "model35", 3, .5)
# Kernel size 5, drop rate = .5
df55 <- process(ROOT, "lenet-model55.csv", "model55", 5, .5)
# Kernel size 5, drop rate = .2
df52 <- process(ROOT, "lenet-model52.csv", "model52", 5, .2)

# Combining all dataframes and adding kernel size, dropout rate
df <- df32 %>% 
  bind_rows(df35) %>% 
  bind_rows(df52) %>% 
  bind_rows(df55) %>% 
  mutate(correct=as.factor(as.logical(correct)),
         kernel=as.factor(as.character(kernel)),
         dropout=as.factor(as.character(dropout)))
head(df)
summary(df)
##   correct        prediction        truth      uncertainty       
##  FALSE: 9160   Min.   :0.000   Min.   :0.0   Min.   :0.0000013  
##  TRUE :30840   1st Qu.:2.000   1st Qu.:2.0   1st Qu.:0.0367498  
##                Median :5.000   Median :4.5   Median :0.1041035  
##                Mean   :4.568   Mean   :4.5   Mean   :0.1025274  
##                3rd Qu.:7.000   3rd Qu.:7.0   3rd Qu.:0.1551265  
##                Max.   :9.000   Max.   :9.0   Max.   :0.3526330  
##      prob1            prob2               class2           diff          
##  Min.   :0.1694   Min.   :0.0000004   Min.   :0.000   Min.   :0.0000124  
##  1st Qu.:0.5685   1st Qu.:0.0234855   1st Qu.:2.000   1st Qu.:0.3084196  
##  Median :0.8069   Median :0.1119699   Median :4.000   Median :0.6873797  
##  Mean   :0.7530   Mean   :0.1398601   Mean   :4.098   Mean   :0.6131886  
##  3rd Qu.:0.9624   3rd Qu.:0.2325295   3rd Qu.:6.000   3rd Qu.:0.9374409  
##  Max.   :1.0000   Max.   :0.4989194   Max.   :9.000   Max.   :0.9999992  
##  diff_sd_ratio       logit_prob1         model           kernel   
##  Min.   :     0.0   Min.   :-1.5899   Length:40000       3:20000  
##  1st Qu.:     1.9   1st Qu.: 0.2757   Class :character   5:20000  
##  Median :     5.8   Median : 1.4300   Mode  :character            
##  Mean   :   242.5   Mean   : 1.9877                               
##  3rd Qu.:    25.3   3rd Qu.: 3.2416                               
##  Max.   :798986.9   Max.   :15.0261                               
##  dropout    
##  0.2:20000  
##  0.5:20000  
##             
##             
##             
## 

Note that we have added two new variables to the data set: kernel (factor, convolution size) and dropout (factor, dropout rate). In the following chunk we generate some summary statistics:

# Summary stats for all models
total_uncertainty <- df %>% 
  group_by(model) %>% 
  summarise(accuracy=sum(as.numeric(correct)-1)/10000,
          mean_uncertainty=mean(uncertainty),
          sd_uncertainty=sd(uncertainty), 
          mean_diff=mean(diff),
          sd_diff=sd(diff),
          mean_ratio=mean(diff_sd_ratio),
          sd_ratio=sd(diff_sd_ratio)) %>% 
  dplyr::arrange(mean_uncertainty)
total_uncertainty

Overall:

  • model52 is the most certain.

  • model35 is the least certain.

  • It seems as though models trained with a higher dropout rate are more accurate, but less certain.

  • The mean differences between the predicted and runner-up class are close to 0.6 for all models.

  • The models with the lowest mean uncertainty correspond with large mean values of \(\tau_{jk}\).

  • All models show an increase in prediction accuracy as compared to their baseline accuracy from training (when no stochastic forward passes were performed):
    • model52: Accuracy has increased by 0.01514 percent.
    • model55: Accuracy has increased by 0.03023 percent.
    • model32: Accuracy has increased by 0.0167 percent.
    • model35: Accuracy has increased by 0.03342 percent.
# Aggregating summary statistics by model and correct/incorrect
agg_df <- df %>% 
  group_by(model, correct) %>% 
  summarise(p=n()/10000,
          mean_uncertainty=mean(uncertainty),
          sd_uncertainty=sd(uncertainty), 
          mean_diff=mean(diff),
          sd_diff=sd(diff),
          mean_ratio=mean(diff_sd_ratio),
          sd_ratio=sd(diff_sd_ratio))

# Sorting by uncertainty: most certain to least certain
ordered_df <- agg_df %>% 
  dplyr::arrange(mean_uncertainty)
ordered_df

1.1 Visualisations

# Distributions of uncertainty by model (left) and by model/classification (right)
p1 <- df %>% 
  ggplot(aes(x=uncertainty)) + 
  geom_histogram(alpha=.5, fill="blue", bins=50) +
  facet_grid(as.factor(model)~.) +
  ggtitle("Distributions of uncertainty")

p2 <- df %>% 
  ggplot(aes(x=uncertainty, col=correct)) +
  geom_freqpoly(alpha=.7) +
  ggtitle("Distribution of estimated uncertainty by classification") +
  scale_color_discrete(name="Prediction",
                         breaks=c("0", "1"),
                         labels=c("0: Incorrect", "1: Correct")) +
  facet_grid(as.factor(model)~.)

layout <- matrix(c(1,2), nrow=1)
multiplot(p1, p2, layout=layout)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# KDEs
p3 <- df %>% 
  ggplot(aes(x=uncertainty)) + 
  geom_density(aes(fill=correct), alpha=.7) +
  facet_grid(as.factor(model)~correct) +
  ggtitle("Kernel density estimations of uncertainty by classification")
p3

# Boxplots of uncertainty by model and classification
df %>% 
  ggplot(aes(x=correct, y=uncertainty)) +
  geom_boxplot(aes(fill=correct), alpha=.7) +
  facet_grid(.~as.factor(model))

# Boxplots of log-transformed tau by model and classification
df %>% 
  ggplot(aes(x=correct, y=log(diff_sd_ratio))) +
  geom_boxplot(aes(fill=correct), alpha=.7) +
  facet_grid(.~as.factor(model))

# Plotting uncertainty against softmax of predicted class, with density estimation contours
df %>% 
  ggplot(aes(x=uncertainty, y=prob1, col=prob1)) +
  geom_point(alpha=.5) +
  geom_density2d(col="black", alpha=.5) +
  labs(y="softmax output of predicted class") +
  scale_color_distiller(name="",
                        palette = "Spectral") +
  facet_grid(~as.factor(model)~correct)

# Plotting log-transformed tau against uncertainty
df %>% 
  ggplot(aes(x=uncertainty, y=log(diff_sd_ratio), col=log(diff_sd_ratio))) +
  geom_point(alpha=.2) +
  #geom_density2d(col="black", alpha=.5) +
  labs(y="log(tau)") +
  facet_grid(~as.factor(model)~correct)

3. References

[1] Gal, Y. & Ghahramani, Z. Bayesian Convolutional Neural Networks with Bernoulli Approximate Variational Inference. arXiv: 1506.02158 (2015).

[2] Gal, Y. & Ghahramani, Z. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. arXiv: 1506.02142 (2015).

[3] Gal, Y. & Ghahramani, Z. Dropout as a Bayesian Approximation: Appendix. arXiv: 1506.02157 (2015).

[4] Leibig, C. & Allken, V. et. al. Leveraging uncertainty information from deep neural networks for disease detection. Scientific Reports volume 7, Article number: 17816 (2017).

[5] Hinton, G. & Srivastava, N. et. al. Improving neural networks by preventing co-adaptation of feature detectors. arXiv: 1207.0580 (2012).

[6] Leibig, C. & Allken, V. et. al. Leveraging uncertainty information from deep neural networks for disease detection: supplementary material. Scientific Reports volume 7, Article number: 17816 (2017).